First, we need a graph. A graph is just a bunch of objects, typically called nodes which are connected to each other. The connections are called edges, and can have a direction, like Tom owes money to Andy, or can be non-directional, like a road connecting two cities.

A graph is a collection of nodes and edges. The BFS algo can tell us if two nodes are connected, and finds the shortest path b/w them.


In [226]:
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from geopy.distance import great_circle
from collections import deque

Date to make a graph with

To make the graph interesting, I am using the airlines and route information database from openflights.org.

So we have two datasets, one about airports, and one about routes b/w airports. For the purpose of the graph, the airports are the nodes, and the routes are the edges b/w them. The routes have a direction, and the cost or weight of the route is the distance.

Nodes aka Airports


In [500]:
cols = ["Airport ID", "Name", "City", "Country", "IATA", "ICAO", "Latitude", "Longitude", "Altitude",
        "Timezone", "DST", "Tz database", "Type", "Source"]
airports = pd.read_csv("data/airports.dat", header=None, names=cols, index_col=0)
print(f"There are {airports.shape[0]} airports")
airports.head()


There are 7184 airports
Out[500]:
Name City Country IATA ICAO Latitude Longitude Altitude Timezone DST Tz database Type Source
Airport ID
1 Goroka Airport Goroka Papua New Guinea GKA AYGA -6.081690 145.391998 5282 10 U Pacific/Port_Moresby airport OurAirports
2 Madang Airport Madang Papua New Guinea MAG AYMD -5.207080 145.789001 20 10 U Pacific/Port_Moresby airport OurAirports
3 Mount Hagen Kagamuga Airport Mount Hagen Papua New Guinea HGU AYMH -5.826790 144.296005 5388 10 U Pacific/Port_Moresby airport OurAirports
4 Nadzab Airport Nadzab Papua New Guinea LAE AYNZ -6.569803 146.725977 239 10 U Pacific/Port_Moresby airport OurAirports
5 Port Moresby Jacksons International Airport Port Moresby Papua New Guinea POM AYPY -9.443380 147.220001 146 10 U Pacific/Port_Moresby airport OurAirports

Step one is is to get rid of all the info we don't need for our graph.

So for Airports, I am just keeping:

  • Airport ID, Unique OpenFlights identifier
  • City
  • Country
  • Latitude/Longitude

In [501]:
# dropping the columns we don't need
keep_cols = ["City", "Country", "Latitude", "Longitude"]
airports = airports[keep_cols]
print(f"There are {airports.shape[0]} airports")
airports.head()


There are 7184 airports
Out[501]:
City Country Latitude Longitude
Airport ID
1 Goroka Papua New Guinea -6.081690 145.391998
2 Madang Papua New Guinea -5.207080 145.789001
3 Mount Hagen Papua New Guinea -5.826790 144.296005
4 Nadzab Papua New Guinea -6.569803 146.725977
5 Port Moresby Papua New Guinea -9.443380 147.220001

In [502]:
# getting rid of null values
airports = airports.dropna(axis=0)
airports.shape


Out[502]:
(7140, 4)

Edges, aka routes flown

The second dataset is the routes flown by airlines.

As of January 2012, the OpenFlights/Airline Route Mapper Route Database contains 59036 routes between 3209 airports on 531 airlines spanning the globe, as shown in the map above.


In [331]:
cols = ["Airline", "Airline ID", "Source airport", "Source Airport ID", "Destination airport",
        "Dest Airport ID", "Codeshare", "Stops", "Equipment"]
routes = pd.read_csv("data/routes.dat", header=None, names=cols)
print(f"There are {routes.shape[0]} routes")
routes.head()


There are 67663 routes
Out[331]:
Airline Airline ID Source airport Source Airport ID Destination airport Dest Airport ID Codeshare Stops Equipment
0 2B 410 AER 2965 KZN 2990 NaN 0 CR2
1 2B 410 ASF 2966 KZN 2990 NaN 0 CR2
2 2B 410 ASF 2966 MRV 2962 NaN 0 CR2
3 2B 410 CEK 2968 KZN 2990 NaN 0 CR2
4 2B 410 CEK 2968 OVB 4078 NaN 0 CR2

We just need:

  • Airline ID: Unique OpenFlights identifier for airline
  • Dest Airport ID: Unique OpenFlights identifier for dest airport

In [332]:
# first drop all rows where stops aren't 0, as we only want direct connections
routes = routes[routes["Stops"] == 0]

keep_cols = ["Source Airport ID", "Dest Airport ID"]
routes = routes[keep_cols]
print(f"There are {routes.shape[0]} routes")
routes.head(10)


There are 67652 routes
Out[332]:
Source Airport ID Dest Airport ID
0 2965 2990
1 2966 2990
2 2966 2962
3 2968 2990
4 2968 4078
5 4029 2990
6 4029 6969
7 4029 \N
8 4029 6160
9 6156 2952

Now there are some values which aren't numbers, so we need to clean them up, as you can see in row 7 above.


In [333]:
def make_int_or_null(x):
    """returns int or np.nan if can't return int"""
    try:
        return int(x)
    except:
        return np.nan
    
routes = routes.applymap(make_int_or_null)

In [334]:
print(f"There are {routes.shape[0]} routes before dropping null values")
routes = routes.dropna(axis=0)
print(f"there are {routes.shape[0]} after dropping null values")


There are 67652 routes before dropping null values
there are 67229 after dropping null values

In [335]:
routes.head()


Out[335]:
Source Airport ID Dest Airport ID
0 2965.0 2990.0
1 2966.0 2990.0
2 2966.0 2962.0
3 2968.0 2990.0
4 2968.0 4078.0

Now we need to be able to get the distance b/w two airports to be able to assign a weight in our graphs.


In [368]:
def route_distance(edge):
    """takes a route as a pandas row, and returns distance b/w them"""
    src_airport = airports.iloc[int(edge["Source Airport ID"])]
    dest_airport = airports.iloc[int(edge["Dest Airport ID"])]
    src_lat = src_airport["Latitude"]
    src_long = src_airport["Longitude"]
    dest_lat = dest_airport["Latitude"]
    dest_long = dest_airport["Longitude"]
    
    src_loc = (float(src_lat), float(src_long))
    dest_loc = (float(dest_lat), float(dest_long))
    
    return great_circle(src_loc, dest_loc).km

#checking algo
print(route_distance(routes.iloc[100]))


204.03946015838426

so the data seems ready to go. We now have two pandas dataframes, airports and routes, and a function which returns the distance b/w airports.

the actual graph

there are many ways to make a graph. One way is to build out all the nodes and connections. So here I'll initiate the graph as a dictionary of nodes, with the data being the edges.

A blank graph where the key is each airport:


In [513]:
graph = {}
for i, row in airports.iterrows():
    graph[i] = []

Now going through each route and adding (dest_airport, distance) to each src_airport in the graph:


In [514]:
from tqdm import tqdm
for i, row in tqdm(routes.iterrows(), total=len(routes), mininterval=0.5):
    try:
        src_airport = airports.iloc[int(row["Source Airport ID"])]
        dest_airport = airports.iloc[int(row["Dest Airport ID"])]
    except:
        pass
    
    dist = great_circle((src_airport["Latitude"], src_airport["Longitude"]),
                       (dest_airport["Latitude"],dest_airport["Longitude"]) ).km
    
    n = int(row["Source Airport ID"])
    d = int(row["Dest Airport ID"])
    if n in graph.keys():
        graph[n].append((d, int(dist)))


100%|██████████| 67229/67229 [01:22<00:00, 813.66it/s]

Now say I want to find out the graph for Karachi. Karachi has three airports, so only looking at the first one for now...


In [515]:
find = airports["City"] == "Karachi"
airports[find]


Out[515]:
City Country Latitude Longitude
Airport ID
2206 Karachi Pakistan 24.906500 67.160797
2213 Karachi Pakistan 24.893600 66.938797
11911 Karachi Pakistan 24.874201 67.118500

So I can see all the cities it's connected to, though since some cities have multiple airports they show up twice. So how do I deal with that? I can ignore multiple airports as the distance is the same, though a more complex graph would take into account the different $$ value.


In [543]:
for i, t in enumerate(graph[2206]):
    if t[0] in graph.keys():
        print(i, airports.loc[t[0]]["City"], t[1])
    else:
        continue


0 Abu Dhabi 1559
1 Chengdu 6496
2 Bangkok 3409
3 Dubai 1507
4 Abu Dhabi 1559
5 Dubai 1507
6 Sharjah 1415
7 Bahrain 7065
8 Bahawalpur 10912
9 Islamabad 626
10 Lahore 335
11 Faisalabad 1588
12 Multan 89
13 Peshawar 656
14 Quetta 621
15 Tehran 11793
16 Colombo 3970
17 Dubai 1507
18 Islamabad 626
19 Lahore 335
20 Multan 89
21 Peshawar 656
22 Quetta 621
23 Abu Dhabi 1559
24 Bahawalpur 10912
25 Mumbai 4096
26 Dhaka 5079
27 Dera Ghazi Khan 18538
28 Delhi 4405
29 Dammam 6480
30 Dubai 1507
31 Gwadar 37
32 Islamabad 626
33 Jeddah 7265
34 Kathmandu 6126
35 Kuala Lumpur 9354
36 Lahore 335
37 London 9212
38 Faisalabad 1588
39 Muscat 1514
40 Madinah 7420
41 Moenjodaro 61
42 Multan 89
43 Peshawar 656
44 Panjgur 577
45 Riyadh 6449
46 Rahim Yar Khan 679
47 Sharjah 1415
48 Sukkur 246
49 Turbat 8236
50 Quetta 621
51 Toronto 10578
53 Dammam 6480
54 Jeddah 7265
55 Riyadh 6449
56 Bangkok 3409
57 Muscat 1514
58 Istanbul 13782
59 Colombo 3970
60 Muscat 1514
61 Jeddah 7265

In [525]:
airports.shape


Out[525]:
(7140, 4)

In [ ]: